A New Approach to Lexical Disambiguation of Arabic Text

نویسندگان

Rushin Shah

Paramveer S. Dhillon

Mark Liberman

Dean P. Foster

Mohamed Maamouri

Lyle H. Ungar

چکیده

We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that overcomes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds of ambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pronounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N + s/PLURAL or wind/V + s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N + s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list Token Lemma Vocalization Segmentation Morphology Gloss yHlm Halam-u 1 yaHolumu ya + Holum + u IV3MS + IV + IVSUFF MOOD:I he / it + dream + [ind.] yHlm Halam-u 1 yaHoluma ya + Holum + a IV3MS + IV + IVSUFF MOOD:S he / it + dream + [sub.] yHlm Halum-u 1 yaHolumo ya + Holum + o IV3MS + IV + IVSUFF MOOD:J he / it + be gentle + [jus.] qbl qabil-a 1 qabila qabil + a PV + PVSUFF SUBJ:3MS accept/receive/approve + he/it [verb] qbl qabol 1 qabol qabol NOUN Before Table 1: Partial output of SAMA for yHlm and qbl. On average, every token produces more than 10 such analyses as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk of confusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been motivated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation – on some definitions, they are identical – and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to characterize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection of tokens (observations), ti, referred to by index i ∈ {1, . . . , n}, where each token is associated with a set of p features, xij , for the jth feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

A Hybrid Approach for the Morpho-Lexical Disambiguation of Arabic

In order to considerably reduce the ambiguity rate, we propose in this article a disambiguation approach that is based on the selection of the right diacritics at different analysis levels. This hybrid approach combines a linguistic approach with a multi-criteria decision one and could be considered as an alternative choice to solve the morpho-lexical ambiguity problem regardless of the diacrit...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Using Fuzzifiers to Solve Word Sense Ambiguation in Arabic Language

Text mining techniques confront many challenges when dealing with the Arabic language including lexical disambiguation because Arabic is a highly inflectional and derivational language, most of the Arabic texts are devoid of diacritics especially Modern Standard Arabic (MSA), thus, it is a must to depend on the ambiguous word context under study. Two fuzzy logic classifiers have been built and ...

متن کامل

‘Repetition’ in Arabic-English Translation: The case of Adrift on the Nile

Abstract This study investigates ‘repetition’ in the English translation of the Arabic Novel, Adrift on the Nile (1993). It aims to explore the communicative functions of ‘repetition’ and to see if these functions have been maintained or lost in the process of translating the Novel. In addition, it seeks to find the translation strategies used in rendering ‘repetition’. To achieve this aim, a d...

متن کامل